The purpose of this application is to determine the affects Covid-19 has had on the US population since the outbreak. This application can be used to determine the total deaths, total positive cases, percentage of people who recovered after testing positive, and the percentage of people who died after testing positive in each state over a selected amount of time. Overall, by using this application, I have found out that almost about 93% people were able to recover after testing positive for Covid-19.
In today’s world, Covid-19 needs no introduction, it started in early January of 2020 and then spread across the whole world overnight. Just like all other countries, United States could not stop this virus from entering it’s borders. Since then, the virus has affected the country greatly and as a response to this ongoing global pandemic, I have developed a web-based application using R and Rstudio for users to be able to visualize and engage with the massive amount of data being accumulated in this unprecedented time. I decided to use R because it has a lot of packages for visualization and manipulating data to create an interactive dashboard to visualize and analyse the COVID-19 data. Shiny is a package in R that makes it easy to build interactive web applications using R-based statistics computation and graphics. This app provides a user-friendly interface, where the user can select the date range and the variable and the shiny package uses these inputs to create a map of the United Stated and highlights the affected states in different colors accordingly. Other packages used in the development of this app are usmap and dplyr.
The dataset that I am using is collected by New York Times. The link to this dataset is attached down below.
The dataset itself has 1,995,353 rows with 6 columns. Each of these columns are explained down below.
date: Starting with 2020-01-21, it goes on until 2021-12-08. This column is basically a date stamp for each entry in this data set. The data type of this column is date.
county: This column stores the county where each entry in the data set occurred. The data type of this column is character.
state: This column stores the state where each entry in the data set occurred. The data type of this column is character.
fips: These are each county’s unique identifiers of each county in each state. The data type of this column is integer
cases: This column stores the number of positive Covid-19 cases recorded on each day in each county in each state. The data type of this column is integer.
deaths: This column stores the number of people died due to Covid-19 on each day in each county in each state. The data type of this column is integer.
In order to make this data usable, first of all I had to clean it up and to do that i started with removing rows with NA values. I used the function na.omit() to remove these rows. After that, I was left with 1,931,590 rows of data. Now moving on to data manipulation, once the user has selected their desired date range, a subset function is used to manipulate the data set and is used to reduce down the data set in accordance with the user input. Then the county and fips columns are removed from the data set to further make the data set easier to handle and work with. Once that is done, an order() function is used on the state column to arrange the data into alphabetical order. While this is going on, i use the unique() function on the state column to extract all the unique state names present in the data set.
Now I add those unique state names to the final data set. Along with that, I use some for loops and if statements to calculate the total number of death and total number of Covid-19 positive cases. All these calculated totals are added to the data set simultaneously as they are being calculated. Now in order to calculate the percentage of people who died after testing positive for Covid-19 in each state, i divide the total number of deaths in each state by the total number of positive cases and then multiply it with 100 to get the percentage value. In order to get the percentage of people who recovered from Covid-19 after testing positive in each state, i subtract the percentage of people who died from 100% to get the percentage value. All these values are added into the final data set and then that final data set is used to produce the color coded map that the user can see on their end.
In order to visualize the final data set. I use the package usmap() along with ggplot2(). The plot_usmap() function in usmap() is smart enough to align all the data in the selected column with each state, hence, i use if-else statements to pass different columns of the final data set in accordance with the user’s input as vectors. Then i use the scale_fill_continuous() function from ggplot2() to fill-in each state on the map according to the state’s standing in each category.
Shiny implements reactivity with two special object classes, UI and Server. The UI object contains the placing, dimensions, and layout of the User Interface. Where as the Server object contains the renderPlot() function that renders the plot and does the calculations and manipulations in the back end that is not visible to the user.
As there are countless outputs and results that the user can yield from this application. I have attached some examples down below. I have used the same date range of 2020-02-04 to 2021-12-01 in all the examples below.
Figure 1
In Figure 1 we can see the visualization of total number of deaths due to Covid-19 in each state. The highest number of deaths is 20,000,000 or more over the course of 22 months. The states with the highest death rate are Texas and California. Trailing behind these states is the state of Florida with about 15,000,000 deaths.
Figure 2
In Figure 2 we can see the visualization of the total number of positive Covid-19 cases in each state. The state with the highest number of positive cases of Covid-19 over the period of almost 22 months is California with a huge 1,500,000,000 and above Covid-19 positive cases. Trailing behind is the state of Texas with about 1,000,000,000 and above Covid-19 positive cases. The state of Florida is somewhere in between the 1,000,000,000 to 500,000,000 range.
Figure 3
In Figure 3 we can see the visualization of the percentage of people who lost their lives to this Covid-19 pandemic after testing positive for Covid-19. The states with the highest percentage are Massachusetts, Connecticut, New Jersey and Rhode Island with the percentage being 3% or above. The state with the lowest percentage of people who lost their lives to Covid-19 is Utah with the percentage of 1% or lower.
Figure 4
In Figure 4 we can see the visualization of percentage of people who have recovered after testing positive for Covid-19. The states with the highest percentages are Utah, Wyoming, Nebraska, Wisconsin, Maine, and Vermont, with the percentages of 99% and above. The states with the lowest percentage of people who have recovered from Covid-19 are Texas, California, Massachusetts, Louisiana, and Mississippi with percentages of 97% or lower.
After looking at all 4 Figures and using the application with different variables and date ranges, i was able to notice some common trends. States that had the highest number of deaths due to Covid-19 also had the highest number of positive Covid-19 cases. Similarly, states with the lowest number of deaths due to Covid-19 also had the least amount of Covid-19 positive cases. In conclusion, the states with denser populations were affected the most and beared huge loss of life where as the scarcely populated states did seem to struggle with Covid-19 as well.